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Abstract. We study in this paper the consequences of using the Mean 
Absolute Percentage Error (MAPE) as a measure of quality for regres¬ 
sion models. We show that finding the best model under the MAPE is 
equivalent to doing weighted Mean Absolute Error (MAE) regression. We 
show that universal consistency of Empirical Risk Minimization remains 
possible using the MAPE instead of the MAE. 

1 Introduction 


We study in this paper the classical regression setting in which we assume given 
a random pair Z = {X,Y) with values in df x K., where A’ is a metric space. 
The goal is to learn a mapping g from A to R such that g{X) ~ Y. To judge 
the quality of the regression model g, we need a quality measure. While the 
traditional measure is the quadratic error, in some applications, a more useful 
measure of the quality of the predictions made by a regression model is given by 
the mean absolute percentage error (MAPE). For a target y and a prediction p. 


the MAPE is 


Imape{p, y) 


\p-y\ 

\y\ 


with the conventions that for all a 0, ^ = oo and that § = 1- The MAPE-risk 
of g is then Lmape{ 9) = E{lMAPE{g{X),Y)). 

We are interested in the consequences of choosing the best regression model 
according to the MAPE as opposed to the Mean Absolute Error (MAE) or the 
Mean Square Error (MSE), both on a practical point of view and on a theoretical 
one. On a practical point of view, it seems obvious that if g is chosen so as 
to minimize LMAPE{g) it will perform better according to the MAPE than a 
model selected in order to minimize Lmse (and worse according to the MSE). 
The practical issue is rather to determine how to perform this optimization: this 
is studied in Section [31 On a theoretical point of view, it is well known (see e.g. 
0) that consistent learning schemes can be obtained by adapting the complexity 
of the model class to the data size. As the complexity of a class of models is 
partially dependent on the loss function, using the MAPE instead of e.g. the 
MSE has some implications that are investigated in this paper, in SectionjH The 
following Section introduces the material common to both parts of the analysis. 



2 General setting 


We use a classical statistical learning setting as in e.g. [1]. We assume given 
N independently distributed copies of Z, the training set, D = = 

. Given a loss function I from to R’*' U {oo}, we define the risk 
of a predictor g, a (measurable) function from A" to R as the expected loss, that 
is Li{g) = E(Z( 5 (X), y)). The empirical risk is the empirical mean of the loss 
computed on the training set, that is: 



( 1 ) 


In addition to Imape defined in the Introduction, we use lMAE{p,y) = \p — y\ 
and lMSE{p,y) = {p-yf- 

3 Practical issues 

3.1 Optimization 

On a practical point of view, the problem is to minimize LMAPE(g)N over a 
class of models Gn, that is to solve! 


M 



Optimization wise, this is simply a particular case of median regression (which 
is in turn a particular case of quantile regression). Indeed, the quotient by 
can be seen as a hxed weight and therefore, any quantile regression implemen¬ 
tation that supports instance weights can be use to find the optimal modeH. 
Notice that when Gm corresponds to linear models, the optimization problem is 
a simple linear programming problem that can be solved by e.g. interior point 
methods [2]. 

3.2 An example of typical results 

We verified on a toy example (the car data set from [3] ) the effects of optimizing 
the MATE, the MAE and the MSE for a simple linear model: the goal is to 
predict the distance taken to stop from the speed of the car just before breaking. 
There are only 50 observations, the goal being here to illustrate the effects of 
changing the loss function. The results on the training selH are summarized in 

^We are considering here the empirical risk minimization, but we could of course include a 
regularization term. That would not modify the key point which is the use of the MAPE. 
^This is the case of quantreg R package [5], among others. 

® Notice that the goal here is to verify the effects of optimizing with respect to different 
types of loss function, not to claim that one loss function is better than another, something 
that would be meaningless. We report therefore the empirical risk, knowing that it is an 
underestimation of the real risk for all loss functions. 




Table [TJ As expected, optimizing for a particular loss function leads to the best 
empirical model as measured via the same risk (or a related one). In practice this 
allowed one of us to win a recent datascience.net challenge about electricity 
consumption predictior@ which was using the MATE as the evaluation metric. 

Loss function RMSE NMAE MATE 

MSE 0.585 0.322 0.384 

MAE 0.601 0.313 0.330 

MATE 0.700 0.343 0.303 

Table 1: Empirical risks of the best linear models obtained with the three loss 
functions. In order to ease the comparisons between the values, we report the 
Normalized Root MSE, that is the square root of the MSE divided by the stan¬ 
dard deviation of the target variable, as well as the Normalized MAE, that is 
the MAE devided by the median of the target variable. 

4 Theoretical issues 


On a theoretical point of view, we are interested in the consistency of standard 
learning strategies when the loss function is the MAPE. More precisely, for a 
loss function Z, we define L* = inig Li{g), where the infimum is taken over all 
measurable functions from X to R. We also denote L*q = inf^gG Li{g) where G 
is a class of models. Then a learning algorithm, that is a function which maps 
the training set D = to a model 'g^, is strongly consistent if Li{gpf) 

converges almost surely to L*. We are interested specifically by the Empirical 
Risk Minimization (ERM) algorithm, that is by gi^N = arg mingeGiv 
The class of models to depend on the data size as this is mandatory to reach 
consistency. 

It is well known (see e.g. [1] chapter 9) that ERM consistency is related 
to uniform laws of large numbers (ULLN). In particular, we need to control 
quantities of the following form 


r 


\ sup 

1 9^Gn 
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( 2 ) 


This can be done via covering numbers or via the Vapnik-Chervonenkis dimen¬ 
sion (VC-dim) of certain classes of functions derived from Gn- One might think 
that general results about arbitrary loss functions can be used to handle the case 
of the MAPE. This is not the case as those results generally assume a uniform 
Lipschitz property of I (see Lemma 17.6 in [T], for instance) that is not fulfilled 
by the MAPE. 


4.1 Classes of functions 

Given a class of models, Gn, and a loss function I, we introduce derived classes 
H{Gn,1) given by 

H{Gn,1) = {Zi : a X R ->■ R+, h{x,y) = l{g{x),y) \ g G Gat}, 

'^https: //datascience .net/f r/challenge/16/details 







and H^{Gni 0 given by 

H^{Gn, 1) = {h: X xRxR^ M+, h{x,y,t) = lt<i{g{x),y) \ 9 S Gat}. 

When this is obvious from the context, we abbreviate the notations into e.g. 
Hn,mape for I = Imape and for the Gn under study. 


4.2 Covering numbers 

4-2.1 Supremum covering numbers 

Let e > 0, a size p supremum e-cover of a class of positive functions F from an 
arbitrary set Z to K'*' is a finite collection fi,..., fp of F such that for all f € F 

rnin sup \f{z) - fi{z)\ < e. 
i<*<p zez 


Then the supremum e-covering number of F, A/'c»(e, F), is the size of the smallest 
supremum e-cover of F. If such a cover does not exists, the covering number is 
oo. While controlling supremum covering numbers of F[(G]\[,l) leads easily to 
consistency via a uniform law of large numbers (see e.g. Lemma 9.1 in [4]), they 
cannot be used with the MATE without additional assumptions. Indeed, let hi 
and /i 2 be two functions from Hn^mape, generated by gi and g 2 in Gn- Then 


II hi /i2 II oo 


sup 

{x,y)G^ xM 


\\9i{x)-y\ - \g 2 {x)-y\\ 
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In general, this quantity will be unbounded as we cannot control the behavior 
of g{x) around y = 0 (indeed, in the supremum, x and y are independent and 
thus unless Gv is very restricted there is always x and gi and 52 such that 
9 i{x) ^ 92 {x) ^ 0). Thus we have to assume that there is A > 0 such that 
|y| > A. This is not needed when using more traditional loss functions such as 
the MSE or the MAE. Then we have 


J\foc{e,H{GN, Imape)) < ^Imae))- 


4-2.2 Lp covering numbers 

Lp covering numbers are similar to supremum covering numbers but are based 
on a different metric on the class of functions F and are data dependent. Given 
a data set D, we define 

/ 1 ^ 

ll/l-/2||p.D= 

i=l 

and derive from this the associated notion of e-cover and of covering number. 
It’s then easy to show that 

■hfp{e, H{Gn, Imape), D) < Np{e min \Yi\, H{Gn,Imae), D). 

l<i<N 




4.3 Uniform law of large numbers 

In order to get a ULLN from a covering number of a class F, one needs a uniform 
bound on F. For instance, Theorem 9.1 from [4] assumes that there is a value Bp 
such that for all f G F and all z G Z, f{z) G [0, Bp], With classical loss functions 
such as MAE and MSE, this is achieved via upper bounding assumptions on 
both Gn and on |E|. In the MAPE case, the bound on Gn is needed but 
the upper bound on |E| is replaced by the lower bound already needed. Let 
us assume indeed that for all g G Gn, ||g||oo < Bcfj- Then if |F| < By, 
we have Bh{Gn,Imae) = ^Gn + By := Bn,mae, while if \Y\ > A, we have 

Bh{Gn,Imape) = 1 + '■= Bn,MAPS- 

Theorem 9.1 from 0 gives then (with Bn,i = Bh(GmI)) 


P 


sup 


Li{g)N-Li{g)\>e\<8E(Afp(^^,H{GN,l),D)) 


(3) 


The expressions of the two bounds above show that By and A play similar 
roles on the exponential decrease of the right hand side bound. Loosening the 
condition on Y (i.e., taking a large By or a small A) slows down the exponential 
decrease. 

It might seem from the results on the covering numbers that the MAPE 
suffers more from the bound needed on Y than e.g. the MAE. This is not the 
case as bounds hypothesis on F are also needed to get hnite covering numbers 
(see the following section for an example). Then we can consider that the lower 
bound on |y| plays an equivalent role for the MAPE to the one played by the 
upper bound on |y| for the MAE/MSE. 


4.4 VC-dimension 

A convenient way to bound covering numbers is to use VC-dimension. Inter¬ 
estingly replacing the MAE by the MAPE cannot increase the VC-dim of the 
relevant class of functions. 

Let us indeed consider a set of k points shattered by {Gn, I map e), (vi, ■ ■ ■, 
Vj = {xj,yj,tj). Then for each 9 G {0,1}^, there is hg G Hn,mape such 
that Vj, lt<he{x,y){xj,yj,tj) = 6j. Each hg corresponds to a gg G Gn and 
t < hg{x, y) ■ Then the set of k points defined by Wj = {zj, \yj\tj) 

is shattered by H^{Gn,Imae) because the h'^ associated in F[n,mae to the gg 
are such that Vj, It<h'^(x,y){xj,yj, \yj\tj) = 9j. Therefore 

Vmape ■= VGdim{H^{GN,lMAPE)) < VGdim{H^{GN,I mae)) '■= VmAE- 

Using theorem 9.4 from [4], we can bound the covering number with a VC- 
dim based value. If U = VGdim{H~^{G n, 1)) > 2, p > 1, and 0 < e < 
then 


Afp{e,H{GN,l),D) <3 


(4) 









When this bound is plugged into equation it shows the symmetry between 
By and A as both appears in the relevant 


4.5 Consistency 

Mimicking Theorem 10.1 from [4], we can prove a generic consistency result for 
MAPE ERM learning. Assume given a series of classes of models, (Gn)n>i such 
that lj„>i Gn is dense in the set of measurable functions from to M according 
to the metric for any probability measure /i. Assume in addition that each 

Gn leads to a finite VC-dim 14 = VCdim{H^{Gm^MAPE) and that each G„ is 
uniformly bounded by Sg„- Notice that those two conditions are compatible 
with the density condition only if lim„_>oo Vn = oo and lim„^oo = oo. 

Assume finally that {X,Y) is such as |y| > A (almost surely) and that 

lim„^oo = 0, then LMAPE{giMAPE,n) converges almost surely to 

^*MAPE^ which shows the consistency of the ERM estimator for the MAPE. 

The proof is based on the classical technique of exponential bounding. Plu¬ 
gin equation (jd]) into equation (jS)) gives a bound on the deviation between the 
empirical mean and the expectation of 


K{n,e) = 24 


/ 16ei?r!, , 24ei?,, 

-log- 


’ 128 B^ 


with Bn = 1-1- Then it is easy to check that the conditions above guarantee 
that X;„>i K{n,e) < oo for all e > 0. This is sufficient to show almost sure 
convergence of LMAPEidiMAPE.n) — to 0. The conclusion follows 

from the density hypothesis. 

5 Conclusion 

We have shown that learning under the Mean Absolute Percentage Error is 
feasible both on a practical point of view and on a theoretical one. In application 
contexts where this error measure is adapted (in general when the target variable 
is positive by design and remains quite far away from zero, e.g. in price prediction 
for expensive goods), there is therefore no reason to use the Mean Square Error 
(or another measure) as a proxy for the MAPE. An open theoretical question 
is whether the symmetry between the upper bound on |V| for MSE/MAE and 
the lower bound on |V| for the MAPE is strong enough to allow results such as 
Theorem 10.3 in [4] in which a truncated estimator is used to lift the bounded 
hypothesis on |y|. 
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